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An extensive analysis of user traffic on Gnutella shows a 
significant amount of free riding in the system. By sampling 
messages on the Gnutella network over a 24-hour period, we 
established that almost 70% of Gnutella users share no files, and 
nearly 50% of all responses are returned by the top 1% of sharing 
hosts. Furthermore, we found out that free riding is distributed 
evenly between domains, so that no one group contributes 
significantly more than others, and that peers that volunteer to 
share files are not necessarily those who have desirable ones. We 
argue that free riding leads to degradation of the system 
performance and adds vulnerability to the system. If this trend 
continues copyright issues might become moot compared to the 
possible collapse of such systems. 
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Introduction 


The sudden appearance of new forms of network applications 
such as Gnutella [Gn00a] and FreeNet [Fr00], holds promise for 
the emergence of fully distributed information sharing systems. 
These systems, inspired by Napster [Na00], will allow users 
worldwide access and provision of information while enjoying a 
level of privacy not possible in the present client-server 
architecture of the Web. 


While a lot of attention has been focused on the issue of free 
access to music and the violation of copyright laws through these 
aos there remains an addtional problem of securing enough 
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even though they may not be aware of its 


existence. 


In a general social dilemma, a group of people attempts to utilize 
a common good in the absence of central authority. In the case of 
a system like Gnutella, one common good is the provision of a 
very large library of files, music and other documents to the user 
community. Another might be the shared bandwidth in the 
system. The dilemma for each individual is then to 


Since files on Gnutella are treated like a public good and the users 
are not charged in proportion to their use, it appears rational for 
people to download music files without contributing by making 
their own files accessible to other users. Because every individual 
can reason this way and free ride on the efforts of others 


The second problem caused by free riding is to 
If 


only a few Individuals contribute to the public good, these few 
peers effectively act as centralized servers. Users in such an 
environment thus become vulnerable to lawsuits, denial of service 
attacks, and potential loss of privacy. This is relevant in light of 
the fact that systems such as Gnutella, Napster, and FreeNet are 
depicted as a means for individuals to 


copyright laws, and providing privacy to individuals. 


Given these concerns we decided to conduct a set of experiments 
to determine the amount of free riding present in the Gnutella 
system. As we show below, a large proportion of the user 
population, upwards of 70%, enjoy the benefits of the system 
without contributing to its content. 


In what follows we describe the basic architecture of Gnutella and 
the experiments that we performed. We then provide an analysis 

of the data and show ways in which such rampant free riding can 

impact distributed systems. Finally we propose some mechanisms 
that can counter free riding. 


18-12-26 13:30 


Adar 


3 el 16 


https://ojphi.org/ojs/index.php/Ffm/rt/printerFriendly/... 


Gnutella 


People who wish to use the Gnutella network will download 
[Gn00a] or develop [Gn00b] an application that adheres to the 
Gnutella protocol. This application acts as either a client (a 
consumer of information) or a server (a supplier of information), 
as well as a high-level network, connecting and routing 
information between clients and servers. Each instance of an 
application is called a peer. We will use peer interchangeably with 
host in the following discussion. 


Gnutella boasts a number of features that make it attractive to 
certain users. For example 


Additiona nutella ae es the mechanism by whic ad-hoc 


Since there are no central servers in the Gnutella network, in 


(although these 


rovide shared files). 


eers interact with each other b 


Once attached to the network 
means of messages. 


receiving an 
ors). The messages allowed in the network 


transmitting to neig 
are: 


- Essentially, an "are you there?" message 
directed at a host. 


e Pong Messages - A reply to a ping ("yes, I'm here"). The eo 


message contains information about the peer such as their 


Peers forward this kind of 
message to their neighbors so that it is possible to later find 
other peers. This is needed in case there is a disconnect in 

the network. 


e Query Messages - These are messages stating, "I am looking 


for x" and can get forwarded throughout the entire network 


at least theoretically). Query messages are uniquely 
e - These are T to ee 


messages, and they include the 
Ganache ir, port, and other location information). 


these messages are not broadcast it becomes impossible to 
trace all query responses in the system. 
- Get messages are simply a request for a 
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file returned by a query. The requesting peer connects to the 


However 


Several features of Gnutella's protocol prevent messages from 
being re- broadcast indefinitely through the network. One such 
feature includes a 


reventing re- broadcasting). 


Additionally, 
At each hop (re-broadcast) the TTL is decremented. As soon as a 
peer sees a message with a TTL of zero, the message is dropped 
(i.e. it is not re- broadcast). 


Free riding in Gnutella 


es of free riding. In the first 


The second definition of free riding considers 


essentially a quantity versus quality argument that also poses a 
social dilemma when there is a cost to the provider to make 
desirable files available to others. In the "old days" of the modem- 
based bulletin board services (BBS), users were required to upload 
files to the bulletin board before they were able to download. In 
response to this requirement users would upload their own bad 
artwork or randomly generated text files and would be able to 
download high quality content generated by others. In the 
experiments described below we address both kinds of free riding. 


Experiments 


In the following section we describe the experiments used to test 
the following three hypotheses: 


e Hypothesis 1: A significant portion of Gnutella peers are free 
riders. 

e Hypothesis 2: Free riders are distributed evenly across 
different domains (and by speed of their network 
connections). 
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e Hypothesis 3: Peers that provide files for download are not 
necessarily those from which files are downloaded. 


Measuring downloads 


One of the features that attract users to Gnutella is the difficulty 
Given a quer 


message It Is virtually Impossible 
of peers collude) to find the peer that originated the query. The 
unfortunate side effect of this property is to make it impossible to 


experimentally measure the number of queries and files 
downloaded by each client. This forces us to make assumptions 
about downloads in order to measure them. 


In this case, there is no free riding. The other 
ossible assumption is that 
Therefore the fewer files a user has the 
more likely he is to download them, resulting in rampant free 
riding. 


Since we unfortunately have no way of knowing which of these 
two extremes is closest to reality, (iia 


Experimental Setup 


In order to perform monitoring experiments on the Gnutella 
network it was necessary to modify a Gnutella client to log 
messages flowing through the system. We elected to use the Java 
based Furi client [Fu00] which was a full featured implementation, 
with numerous hooks for logging. 


The Furi client was then executed for a 24-hour period over a 
weekend in August of 2000 (Saturday 1pm to Sunday 1pm) [1]. 
During this time period we collected both pong and query 
response messages from normal Gnutella users. A shorter trace 
during a weekday shows results consistent with the weekend 
findings. In the 24-hour period we observed 35,352 hosts issuing 
ping messages, which shared a total of 3,304,046 files. 


One of the difficulties in measurin 


In our study we witnessed 
; address in 
ping messages. e which also utilize a 
unique client identifier (in a address) we saw 937 


ition to an 
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While the possible range of 5% to 16% seems high, we find that 


2]. This leaves with a final count of 33,335 hosts 
Sharing 3,100,464 files. 


Although we could not capture all query response messages it was 
ponethelessipossibie to sample a wide selection by shitting 
‘locations (i.e., by reattaching to different hosts) within the 


Gnutella network. Over the 24-hour period, we were thus able to 
capture 87,668 query response messages. 


Results 


Figure 1 illustrates the number of files shared by each of the 
33,335 peers we counted in our measurement. The sites are rank 
ordered (i.e. sorted by the number of files they offer) from left to 


right. These results indicate that 22,084, or approximatel 


Rank Ordering of Peers by Number of Files Shared 
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Figure J 


Although NAT allows firewalled hosts to share files, if both the 
sharing host and downloading host have NAT addresses the 
transaction cannot be completed. Thus 


With 5% of hosts using NAT, this is a 
trivial .25%. However, as we approach 16% this turns into over 2% 
of transactions. While this in EEN eS 
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ee pone TED These probabilities push the 
zero Share statistics up to 69%. 


The data also shows that 


Table 1 shows the values of the in-between data points. 


| The top | Share [As percent of the whole 
[333 hosts (1%) (1,142,645|37% 
[1,667 hosts (5%) 2,182,087|70% 
[3,334 hosts (10%) |2,692,082|87% 
[5,000 hosts (15%) |2,928,905|94% 
[6,667 hosts (20%) |3,037,232|98% 
[8,333 hosts (25%) |3,082,572[99% 


Table 1 


Rank Ordering of Peers by Query Responses 
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Figure 2 


As per our second definition of free riding we determined which 


hosts provide files and which hosts provide files that are actuall 
downloaded. We attempted to capture this by analyzing the query 
The difficulty with analyzing this data is that it is 


% we find that after eliminating hosts that provide no 
downloadable files we were left with a set of 11,585 hosts. 
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Again, we measured a considerable amount of free riding on the 
Gnutella network. Out of the sample set, 7,349 peers, or 
These were 


Figure 2 illustrates the data by depicting the rank ordering of 
these sites versus the number of query responses each host 
provided. We again see a rapid decline in the responses as a 
function of the rank, indicating that very few sites do the bulk of 
the work. Of the 11,585 sharing hosts the 


Who Shares Files? 


In our second experiment we verified the hypothesis that files and 
query responses (and therefore free riders) are shared equally 
across different domains. The implication is that hosts based in 
domain a do not contribute more than hosts in domain b in terms 
of the ratio of peers on the network to files and responses offered. 
This does not imply that certain domains contribute more or less 
total hosts to the network, but simply that free riders are 


distributed equally. Additionally, 
n A example aol.com hosts tend to operate on 


modems, and rr.com on cable modem connections). Therefore, 


In order to do this analysis we filtered our initial test set to 26,014 
peers. These were hosts with IP addresses that were readily 
converted to host names. We then counted the number of hosts in 
each domain (mit.edu, home.com, etc.) as well as the number of 
hosts in each top-level domain, or TLD (.edu, .com, .net, etc.). 
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Peer Court vs. Resporses by TLD 
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In our set of hostnames there were 2,538 unique domains. The 
range of peers in each ranged from 1 to a maximum of 2,951. 
Figure 3a above illustrates this data. Each of the points in the 
figure represents a domain in terms of the number of peers (the 
x-axis) and the total number of files shared (the y-axis). The 
dashed line is the trend line for this data. A regression of the two 


dimensions ine an fot k value of eer eee that 


Figure 3b depicts the relationship between query responses and 
peer count. Again, a regression on this sample of 1,276 domains 
reveals a fairly linear relationship between the two dimensions 
(with an r-squared of 0.922). We consider this evidence of an even 
distribution of free riders [3]. 


Figures 4a and 4b display the equivalent data sets for TLDs (edu, 
net, org, etc.). Figure 4a represents the 77 top-level domains in 
terms of peer count to the number of files shared. Figure 4b 
represents 61 top-level domains in terms of peer count to query 
responses. Again, there appears to be a linear relationship in both 
figures with the regression fitting with an r-squared of 0.953 and 
0.958 for figures 4a and 4b respectively. 


Quality vs. Quantity 


In the final experiment we tested our hypothesis that the number 
of queries answered is not necessarily proportional to the number 
of files offered. This provides a test of the "quality" vs. quantity 

argument. The intuition is that the kinds of queries that are issued 


b 


e files that are returned for these queries are therefore 
more desirable, which defines their quality. Therefore, only a small 
number of peers will actually share anything that is considered to 
be high "quality." 
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Files Shared vs. Queries Answered 
{log-log plot) 


Query Responses 


100 1000 10000 100000 
Files Shared 


Figure 5 


We found the degree to which queries are concentrated through a 
separate set of experiments in which we recorded a set of 
202,509 Gnutella queries. The top 1 percent of those queries 
accounted for 37% of the total queries on the Gnutella network. 
The top 25 percent account for over 75% of the total queries. In 
reality these values are even higher due to the equivalence of 
queries ("britney spears" vs. "Spears britney"). 


The predicted behavior is present to some extent. For example 
the top respon The next most responsive peer hosted 956 files 


and responded to 1,474 queries. 


Figure 5 illustrates the relationship between files hosts (the x-axis) 
and query responses (the y-axis) for 10,510 peers. As is apparent 
from the 


regression analysis yields a very low r-squared value of 0.00105 
for this data. 


Discussion 


Studies of social dilemmas [G194] [Hu96] [Hu97] have shown that 


As we have shown in this paper, Gnutella is no exception 


to this finding, and an experimental study of its user patterns 
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If distributed systems such as Gnutella rely on voluntar 


Thus, the current debate over copyright might 
ecome a non-issue when compared to the possible collapse of 
such systems. This collapse can happen because of two factors, 
the tragedy of the digital commons, and increased system 
vulnerability, which we now discuss. 


The Tragedy of the Digital 
Commons 


An ideal analysis of free riding would allow us to calculate the 
contribution provided by individuals in exchange for consumption 
(either in proportion or some fixed cost). There are two ways in 
which individuals on Gnutella can contribute. The first is simply by 
uploading files. 


to the system leading to at least two ways in 
which the quality of the service degrades. 


First, peers that provide files are set to only handle some limited 
number of connections for file download. This limit can essentially 
be considered a bandwidth limitation of the hosts. Now imagine 
that there are only a few hosts that provide responses to most file 
requests (as was illustrated in the results section). As the 
connections to these peers is limited they will rapidly become 
saturated and remain so, thus preventing the bulk of the 
population from retrieving content from them. 


A second way in which quality of service degrades is through the 

impact of additional hosts on the search horizon. The search 

horizon is the farthest set of hosts reachable by a search request. 

For example, with a time-to-live of five, search messages will 

reach at most peers that are five hops away. Any host that is six 
is unreachable and therefore outside the horizon. 


Vulnerability 


One argument that has appeared in the popular press regarding 
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systems such as Gnutella [Or00] is that there is a diminished risk 
of the system being shut down by either lawsuit or attack. 


y the press, allowed users to believe that they were safe among 


others. T in e of the evidence provided above, 


As we have seen in the experiments, there is a small collection of 
peers that provide the bulk of the shared files and answered 
queries. These few providers act as a rather centralized server 
consisting of several peers and thus the RIAA need not sue all 
users or even the bulk of users. 
(of which there are very few that serve very 


many). 


Overcoming free riding 


There are many ways of patching Gnutella so that it can 
accommodate the same privacy rules but scale more 

effectively.[5] It is interesting therefore to establish how different 
file-sharing applications rely on technological features to induce 


second cost of the automatic replication as implemented in 
FreeNet is the 


In this wa 


In some ways this feature addresses the 
FreeNet problem because users will only keep "good" files on their 
computers. However, users can easily circumvent this shared 
upload/download directory and frequently do. We have also 
witnessed Napster users misrepresenting the speed of their 
network connections (saying they are on a modem when they are 
on a high speed connection) in order to discourage other users 
from connecting to them. Both system provide their own set of 
solutions to the free riding but at the cost of introducing other 
problems to their systems. 


Another possible solution to this 


This can be 


accomplished by setting up a 
very 
much in the spirit in which Spawn was created [Wa92]. In this 
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context we should stress that the (ei SS D 
For instance, issues of prestige 


or status drive participation in open source systems like Linux 
[Lo00] and the same can be said of SETI@Home[Se00], where 
obviously to be the owner the PC that detects the first intelligent 
signal from outer space would constitute great utility. 


Another alternative for eliminating free riding is to 
or example the Usenet system, while allowing some degree 
of anonymit rovided a great advantage to individual users as 


That is, the 
only cost to the user was the initial posting; afterwards the 
message was propagated by the system. 


Conclusions 


In this paper we analyzed user traffic in Gnutella and concluded 
that there is a significant amount of free riding in the system. 
Specifically, we found that nearly 70% of Gnutella users share no 
files, and nearly 50% of all responses are returned by the top 1% 
of sharing hosts. Furthermore, we found that free riding is 
distributed evenly between domains, so that no one group 
contributes significantly more than others, and that peers that 
volunteer to share files are not necessarily those who have 
desirable ones. 


These findings have serious implications for the future 
development of Gnutella and its many variants. 


Sometimes, the logic behind the decision to cooperate or not 
changes when the interaction is ongoing, since future expected 
utility gains will join present ones in influencing the rational 
individual's decision. In particular 


Hu96]. An interesting 
continuation of these experiments may lead to an understanding 
of how free riding changes over time. 
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Notes 


1. A much smaller experiment during a weekday revealed that in 
a sample of over 300 hosts 72% of share no files, a result 
consistent with our extended study. 


2. NAT hosts shared no files 68.7% of the time, and ten or less 
files 74.5% of the time. The top 1% of NAT hosts shared 37.8% of 
the total files, and the top 25% shared 99.4% of the total files. 


3. Of tangential interest may be the top number of hosts sharing 
files. The top 5 domains are (from most to least) home.com, 
rr.com, aol.com, t-dialin.net, and mediaone.net. The top hosts in 
query responses are home.com, rr.com, mediaone.net, ks.us, and 
pacbell.net. 


4. The top five domains for queries in the first-level domain in 
terms of files shared are: net, de, nl, edu, and ca. For queries 
answered they are: com, net, edu, de, and nl. 


5. Hint: Mix one part mailing list, one part anonymous bulletin 
board (see for example [Ch85]), and one part anonymous re- 
mailer (add more re-mailers depending on taste for paranoia). 


6. If a user requests a bad file (Say a bomb or Trojan [St00]), this 
file is replicated between all computers from the host uploading to 
the host downloading. 
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